
formula the Lost function
Matrix format is :

Assume the each

the likelihood for the linear regression will be
or using the log likelihood,
so we get

the optmization point will be in the blue curve between the


The

Example:
If we have
Then
or
in practise , we choose the location
given training data (x,y) and some
so the paramters




Given a dataset
Computing posterior distribution is known as the inference problem. But:
This integral can be very high-dimensional and difficult to compute.
Model parameters:

Give the parameters
[Lecture 7. Bayesian Learning — ML Engineering (ml-course.github.io)
[Gaussian processes (3/3) - exploring kernels (peterroelants.github.io)
Understanding the training data
With likelihood is Gaussian
here is how the cacluation:
in above steps we get
The posterior
when
if
Given condition,Likelihood
Theorem 4.4.1 (Bayes rule for linear Gaussian systems). Given a linear Gaussian system, as in Equation 4.124, the posterior
posterior distribution
after learning, the predict function (convolution of two gaussian )
beside above proof, please also refer to chapter 4.4.3 (KPM), utilize the
where the precision matrix of the joint is defined as
From Equation 4.69 and using the fact that
Predict the probability that a point belongs to a certain class, using Bayes’ Theorem
Problem: since x is a vector, computing
Naively assume that all features are conditionally independent from each other, in which case:
Very fast: only needs to extract statistics from each feature.

Posterior mean:
The given training data
ML predictor have , which believe there only one true
Bayesian Hyperparameter Optimization using Gaussian Processes | Brendan Hasz
How to Implement Bayesian Optimization from Scratch in Python - MachineLearningMastery.com
Tutorial #8: Bayesian optimization - Borealis AI
Bayesian Optimization is used when there is no explicit objective function and it’s expensive to evaluate the objective function.
As shown in the next figure, a GP is used along with an acquisition (utility) function to choose the next point to sample, where it’s more likely to find the maximum value in an unknown objective function.
A GP is constructed from the points already sampled and the next point is sampled from the region where the GP posterior has higher mean (to exploit) and larger variance (to explore), which is determined by the maximum value of the acquisition function (which is a function of GP posterior mean and variance).
To choose the next point to be sampled, the above process is repeated.
The next couple of figures show the basic concepts of Bayesian optimization using GP, the algorithm, how it works, along with a few popular acquisition functions.


Bayesian Optimization Algorithm:
For t=1,2,… do
Find
Sample the objective function
Augment the data
End for
basic we should choose the next point x where the
mean is high (exploitation)
the variance is high (exploration)
The acquisition function
The acquisition function takes in the obtained GP fit (mean and variance at each point) and returns the hyper-parameter value for the next run of the machine learning algorithm. For more details on the derivation refer to this article. More kernel visualizations can be found here.
The
a is action,
At iteration
we don't have the true objective , so we get expected improvement function as following:
where

x
# Generate function prediction from the parameters and the loss (run bayesian regression)loss_predicted,sigma,loss_evaluated = generate_prediction()
# Calculate expected improvement (finding the maximum of the information gain function)expected_improvement = calculate_expected_improvement(loss_predicted, sigma, loss_evaluated)
def generate_prediction(): list_of_parameters_stack, list_of_parameters = normalize_current_values() loss_evaluated = self.parameters_and_loss_dict['loss'] if len(self.parameters_and_loss_dict['loss'])>1: loss_evaluated = (parameters_and_loss_dict['loss']- np.mean(self.parameters_and_loss_dict['loss']))/(np.std(self.parameters_and_loss_dict['loss'])+1e-6)
loss_predicted, sigma = fit_gp(list_of_parameters_stack, list_of_parameters, loss_evaluated) return loss_predicted,sigma,loss_evaluated def calculate_expected_improvement(loss_predicted, sigma, loss_evaluated):
# Calculate the expected improvement eps = 1e-6 num =(loss_predicted-max(loss_evaluated)-eps) Z=num/sigma expected_improvement = num*scs.norm(0,1).cdf(Z)+sigma*scs.norm(0,1).pdf(Z) expected_improvement[sigma==0.0] = 0.0
return expected_improvement
Define the regret and cumulative regret as follows:
The GP-UCB criterion is as follows:
Beta is set using a simple concentration bound:
With



Bayesian approaches, in contrast to random or grid search, keep track of past evaluation results which they use to form a probabilistic model mapping hyperparameters to a probability of a score on the objective function:
In the literature, this model is called a “surrogate” for the objective function and is represented as p(y | x). The surrogate is much easier to optimize than the objective function and Bayesian methods work by finding the next set of hyperparameters to evaluate on the actual objective function by selecting hyperparameters that perform best on the surrogate function. In other words:
Build a surrogate probability model of the objective function
Find the hyperparameters that perform best on the surrogate -- have one best guess from the acquisition function.
Apply these hyperparameters to the true objective function -- get the new loss data from loss function. (scores)
Update the surrogate model incorporating the new results --- put the x (hyperparameters) and y (loss data from true object function) to fit/update the model
Repeat steps 2–4 until max iterations or time is reached
At a high-level, Bayesian optimization methods are efficient because they choose the next hyperparameters in an informed manner**.** The basic idea is: spend a little more time selecting the next hyperparameters in order to make fewer calls to the objective function.
Sequential model-based optimization (SMBO) methods (SMBO) are a formalization of Bayesian optimization. The sequential refers to running trials one after another, each time trying better hyperparameters by applying Bayesian reasoning and updating a probability model (surrogate).
There are five aspects of model-based hyperparameter optimization:
A domain of hyperparameters over which to search
An objective function which takes in hyperparameters and outputs a score that we want to minimize (or maximize)
The surrogate model of the objective function
A criteria, called a selection function, for evaluating which hyperparameters to choose next from the surrogate model
A history consisting of (score, hyperparameter) pairs used by the algorithm to update the surrogate model
There are several variants of SMBO methods that differ in steps 3–4, namely, how they build a surrogate of the objective function and the criteria used to select the next hyperparameters. Several common choices for the surrogate model are Gaussian Processes, Random Forest Regressions, and Tree Parzen Estimators (TPE) while the most common choice for step 4 is Expected Improvement. In this post, we will focus on TPE and Expected Improvement.
In the case of random search and grid search, the domain of hyperparameters we search is a grid.
The objective function takes in hyperparameters and outputs a single real-valued score that we want to minimize (or maximize) --- loss function or maximum likelihood function. As an example, let’s consider the case of building a random forest for a regression problem. The hyperparameters we want to optimize are shown in the hyperparameter grid above and the score to minimize is the Root Mean Squared Error.
While the objective function looks simple, it is very expensive to compute! If the objective function could be quickly calculated, then we could try every single possible hyperparameter combination (like in grid search). If we are using a simple model, a small hyperparameter grid, and a small dataset, then this might be the best way to go. However, in cases where the objective function may take hours or even days to evaluate, we want to limit calls to it.
The entire concept of Bayesian model-based optimization is to reduce the number of times the objective function needs to be run by choosing only the most promising set of hyperparameters to evaluate based on previous calls to the evaluation function. The next set of hyperparameters are selected based on a model of the objective function called a surrogate.
The surrogate function, also called the response surface, is the probability representation of the objective function built using previous evaluations. This is called sometimes called a response surface because it is a high-dimensional mapping of hyperparameters to the probability of a score on the objective function. Below is a simple example with only two hyperparameters:

There are several different forms of the surrogate function including Gaussian Processes and Random Forest regression. However, in this post we will focus on the Tree-structured Parzen Estimator as put forward by Bergstra et al in the paper “Algorithms for Hyper-Parameter Optimization”.
please refer to the The acquisition function
The selection function is the criteria by which the next set of hyperparameters are chosen from the surrogate function.
Moreover, because the surrogate is just an estimate of the objective function, the selected hyperparameters may not actually yield an improvement when evaluated and the surrogate model will have to be updated. This updating is done based on the current surrogate model and the history of objective function evaluations.
Each time the algorithm proposes a new set of candidate hyperparameters, it evaluates them with the actual objective function and records the result in a pair (score, hyperparameters). These records form the history. The algorithm builds "Surrogate function" using the history to come up with a probability model of the objective function that improves with each iteration.
consider how to deal with noisy observations, how to choose a kernel, how to learn the parameters of that kernel, how to exploit parallel sampling of the function.
in noise GP model, We no longer observe the function values
and the conditional probability of a new point becomes:
where
Incorporating noise means that there is uncertainty about the function even where we have already sampled points , and so sampling twice at the same position or at very similar positions could be sensible.

When we build the model of the function and its uncertainty, we are assuming that the function is smooth. If this was not the case, then we could say nothing at all about the function between the sampled points. The details of this smoothness assumption are embodied in the choice of kernel covariance function.
We can visualize the covariance function by drawing samples from the Gaussian process prior. In one dimension, we do this by defining an evenly spaced set of points
Squared Exponential Kernel: include the amplitude
where
Matérn kernel:
Periodic Kernel:
1. Maximum likelihood: similar to training ML models, we can choose these parameters by maximizing the marginal likelihood (i.e., the likelihood of the data after marginalizing over the possible values of the function):
where
In Bayesian optimization, we are collecting the observations sequentially, and where we collect them will depend on the kernel parameters, and we would have to interleave the processes of acquiring new points and optimizing the kernel parameters.
2. Full Bayesian approach: here we would choose a prior distribution
In practice this would usually be done using an Monte Carlo approach in which the posterior is represented by a set of samples (see Snoek et al., 2012) and we sum together multiple acquisition functions derived from these kernel parameter samples (figure 9).

Figure: Bayesian approach for handling length scale of kernel from Snoek et al., 2012. a-c) We fit the model with three different length scales and compute the acquisition function for each. d) We compute a weighted sum of these acquisition functions (black curve) where the weight is given by posterior probability of the data at each scale (see equation 22). We choose the next point by finding the maximum of this weighted function (black arrow). In this way, we approximately marginalize out the length scale.
cost function
Gradient or solve this directly (=0, when equal to 0, it mean reach the minimal point):
非矩阵求解:
矩阵求解:
The gradient vector of
, , the
Batch
Online
Mini-batch
机器学习中最常见的优化算法是基于梯度的优化方法,当目标函数是一个类似如下结构的随机函数 F(θ) 时:

优化该类目标函数,最核心的计算问题是对随机函数 F(θ) 的梯度进行估计,即:

随机函数梯度估计在机器学习以及诸多领域都是核心计算问题,比如:变分推断,一种常见的近似贝叶斯推断方法;强化学习中的策略梯度算法;实验设计中的贝叶斯优化和主动学习方法等。其中,对于函数期望类目标问题,最常见的是基于蒙特卡洛采样的方法。
蒙特卡洛采样(MCS)
MCS 是一种经典的求解积分方法,公式(1)中的问题通常可以用 MCS 近似求解如下:

其中,
estimates:


https://gaussianprocess.org/gpml/
https://zhuanlan.zhihu.com/p/31427491
https://www.cnblogs.com/tornadomeet/archive/2013/06/15/3137239.html
https://sandipanweb.wordpress.com/2020/12/08/gaussian-process-regression-with-python/
in python sampling normal distribution like : x = sigma * np.random.randn(10) + mu
sampling the distribution using the uniform distribution generate the random number and mapping it to the cumulative Distribution Functions (CDFs) function to get the samples

in case two independent univariate Gaussian variables ,
their joint distribution is
Theorem 4.3.1 (marginal and conditionals of an MVN), suppose
then the marginals are given by
and the posterior conditional distribution is given by (refer to KPM to know following equations detailly)
推导过程看KPM chapter 4.3.4.3.. this 4.69 also used for the proof of the Theorem 4.4.1(Bayes rule for linear Gaussian systems).
也即是根据Joint distribution
高斯过程,从字面上分解,我们就可以看出他包含两部分
- 高斯,指的是高斯分布
- 过程,指的是随机过程

首先当随机变量是1维的时候,我们称之为一维高斯分布,概率密度函数
对于一个连续域 T (假设他是一个时间轴),如果我们在连续域上任选 n 个时刻: t1,t2,t3,...,tn∈T ,使得获得的一个 n 维向量
对于一个 p 维的高斯分布而言,决定他的分布是两个参数,一个是 p 维的均值向量 μp ,他反映了 p 维高斯分布中每一维随机变量的期望,另一个就是 p×p 的协方差矩阵 Σp×p ,他反映了高维分布中,每一维自身的方差,以及不同维度之间的协方差
定义在连续域 T 上的高斯过程其实也是一样,他是无限维的高斯分布,他同样需要描述每一个时间点 t 上的均值,但是这个时候就不能用向量了,因为是在连续域上的,维数是无限的,因此就应该定义成一个关于时刻 t 的函数:
协方差矩阵也是同理,无限维的情况下就定义为一个核函数
这里面的 σ 和 l 是径向基函数的超参数,使我们提前可以设置好的,例如我们可以让 σ=1 , l=1 ,从这个式子中,我们可以解读出他的思想
和 t 表示高斯过程连续域上的两个不同的时间点,
由此,高斯过程的两个核心要素:均值函数和核函数的定义我们就描述清楚了,按照高斯过程存在性定理,一旦这两个要素确定了,那么整个高斯过程就确定了:
The random process is the
means the multiple random
and

Cholesky Decomposition
定理: 若
So sampling :
kernel function https://peterroelants.github.io/posts/gaussian-process-kernels/
while give new

合法的协方差矩阵就是 (symmetric) Positive Semi-definite Matrix
矩阵A正定是指,对任意的X≠0恒有
矩阵A半正定是指,对任意的X≠0恒有
判定A是半正定矩阵的充要条件是:A的所有顺序主子式大于或等于零。
如果你了解SVM的话,就会接触过一个著名的Mercer Theorem,(当然如果你了解泛函分析的话也会接触过 ),这个M定理是在说:一个矩阵是Positive Semi-definite Matrix当且仅当该矩阵是一个Mercer Kernel .
所以这是一种贝叶斯方法,和OLS回归不同,这个方法给出了预测值所隶属的整个(后验)概率分布的。再强调一下,我们得到的是f* 的整个分布!不是一个点估计

GP posterior general (Bayesian)
Given a test set
Where
here we could use a squared exponential kernel, aka Gaussian kernel or RBF kernel. In 1d, this is given by
the real algorithm for this computing

We can extend the SE kernel to multiple dimensions as follows:
Refer to KPM, chapter 15.2.4 Estimating the kernel parameters
Refer to Maximum Likelihood Estimation of Gaussian Parameters (jrmeyer.github.io)
Learning the kernel parameters (maximum (marginal) likelihood, Bayesian, cross validation)
include the Noisy parameters
where
. It takes time to compute , and then time per hyperparameter to compute the gradient.
here
Optimize it Using GD to get the
The form of
depends on the form of the kernel, and which parameter we are taking derivatives with respect to. Often we have constraints on the hyper-parameters, such as In this case, we can define , and then use the chain rule. Given an expression for the log marginal likelihood and its derivative, we can estimate the kernel parameters using any standard gradient-based optimizer. However, since the objective is not convex, local minima can be a problem.
In another webpage(Optimizing Parameters (mlatcl.github.io))(I also don't see no any further steps based on those complex equations, in GPy code??) :
The parameters are inside the covariance function (matrix)
Λ represents distance on axes. R gives rotation.
is diagonal, . Useful representation since
.
In the previous section, we assumed that we can compute the gradient exactly. However, if the dimension of the vector y, n increases, it might not be possible to compute the above gradient in a reasonable time and cost. Let’s analyze the computational complexity of each term. Learning Gaussian Process Covariances (chrischoy.github.io)
make the Gradient of the Posterior Probability easier
the computing complexity from

Ridge regression :
so the computation can be done with
each
Maximum Likelihood Estimation of Gaussian Parameters (jrmeyer.github.io)
Theorem (Jensen's Inequality)
If
Moreover, if
Let
For any PMF
marginal log likelihood
To select the inducing input
The initial joint model
where the conditional prior is given by
Posterior distribution
Log Marginal likelihood
Approximate the true posterior distribution by introducing a variational distribution
To determine the variational quantities
Condition givens
Maximum the bound w.r.t. the distribution
The usual way of doing this is to take the derivative w.r.t.
a faster and by far simpler way to compute the optimal boundis by reversing the Jensen’s inequality, moving the log outside of the integral in eq
The optimal distribution
Definition 8.11 (KL divergence). For two distributions q(x) and p(x)
using approximate q(z) to estimate p(z|x)
翻译一下就是, 在给定数据情况下不知道latent variable 的分布,比如GMM,不知道中间有几个高斯以及各个高斯的参数,采用猜测给定高斯分布
Conditional distribution:
so we get:

The goal is to maximum the likelihood of given the tanning data
partial derivate to
partial derivate to
Entropy H is a measure of the uncertainity associated with a random variable. it defined as:
Bernoulli distribution Example, the entropy is
Maximum MLE means minimizes the KL divergence
Apply the expection equation:



(A Unifying View of Sparse Approximate Gaussian Process Regression) Joaquin Qui ˜nonero-Candela, Carl Edward Rasmussen 2005
Probabilistic regression is usually formulated as follows: given a training set D = {(x*i*, yi), i =
1, . . . ,n} of n pairs of (vectorial) inputs x*i* and noisy (real, scalar) outputs yi, compute the predictive distribution of the function values
Gaussian process (GP) regression is a Bayesian approach which assumes a GP prior2 over functions, i.e. assumes a priori that function values behave according to
where f = [ f1, f2, . . . , fn]> is a vector of latent function values, fi = f (x*i*) and K is a covariance matrix, whose entries are given by the covariance function, Ki j = k(x*i,xj*)
exact expression
Due to the consistency of Gaussian processes, we know that we can recover p(f_, f) by simply
integrating (marginalizing) out u from the joint GP prior p(f_, f,u)
The name inducing variable is motivated by the fact that f and f* can only communicate though u, and u therefore induces the dependencies between training and test cases. As we shall detail in the following sections, the different computationally efficient algorithms proposed in the literature correspond to different additional assumptions about the two approximate inducing conditionals q(f|u), q(f_|u) of the integral in (8).
as special (noise free) cases of the standard predictive equation (6) with u playing the role of (noise free) observations. shorthand notation
For any input x_, the corresponding function value f_ is given by:
Projected Process Approximation (PPA) : the method as relying on a likelihood approximation, based on the projection
Reformulation to

While the DTC is based on the likelihood approximation given by (17), the SGPP proposes
a more sophisticated likelihood approximation with a richer covariance
where diag[A] is a diagonal matrix whose elements match the diagonal of A.

in 2009, Michalis Titsias published a paper that proposed a different approach: “Variational Learning of Inducing Variables in Sparse Gaussian Processes” (Titsias 2009). This method does not quite fit into the unifying view proposed by Quiñonero-Candela. The key idea is to construct a variational approximation to the posterior process, and learn the pseudo-points Z alongside the kernel parameters by maximising the evidence lower bound (ELBO), i.e. a lower bound on the log-marginal likelihood. There was quite a bit of prior art on variational inference in Gaussian processes (e.g. Csató and Opper 2002; Seeger 2003): Titsias’ important contribution was to treat Z as parameters of the variational approximation, rather than model parameters.
Instead Of doing
We will do
KERNAL FUNCTION:
At each step during online learning of the GP, the proposed algorithm is presented with a training pair that is used to update the existing model. Assuming at time, the model is given by
This paper provides a new principled framework for deploying Gaussian process probabilistic models
in the streaming setting.
The framework subsumes Csató and Opper’s two seminal approaches to online regression [8, 9] that were based upon the variational free energy (VFE) and expectation propagation (EP) approaches to approximate inference respectively. In the new framework, these algorithms are recovered as special cases.
also provide principled methods for learning hyperparameters (learning was not treated in the original work and the extension is non-trivial) and optimizing pseudo-input locations (previously handled via hand-crafted heuristics).
The approach also relates to the streaming variational Bayes framework [10
Given N input and real-valued output pairs
variational free energy approximation scheme which lower bounds the marginal likelihood of the data using a variational distribution q(f) over the latent function:
Since
divergence, maximising this lower bound with respect to q(f) guarantees the approximate posterior
gets closer to the exact posterior
the marginal likelihood and can be used for learning the hyperparameters

Consider an approximation to the true posterior at the previous step, qold(f), which must be updated
to form the new approximation qnew(f),
the new approximation cannot access
The form of the approximate posterior mirrors that in the batch case, that is, the previous approximate posterior,
where
The negative variational free energy is also analytically available,
In the Bayesian approach to statistical inference, the degrees of prior belief or plausibility of parameters are expressed within probability distributions, the so called prior distributions (or priors) p(θ). Once this idea is accepted, subsequent inferences can be based on Bayes rule of probability. Formally, we may think that data are generated by a two step process:
First, the true parameter θ is drawn at random from the prior distribution p(θ).
Second, the data are drawn at random from

In order to construct an online algorithm within the Bayesian frameweork, we have to find out how the posterior distribution changes when a new datapoint
does not have the form of an online algorithm, because it requires the knowledge of the entire old dataset Dt.
Update: Use the old approximative posterior
Project: The new posterior
Minimizing (6.3) can be thought of as minimizing the loss of information in the projection step. For the important case, where the parametric family is an exponential family, i.e. if the densities are of the form
If we use a general multivariate Gaussian distribution for
Using a simple property of centered Gaussian random variables z, namely the fact that for well behaved functions f, we have
Online Learning
Often, learning algorithms for estimating the unknown parameter θ∗ are based on the principle of Maximum Likelihood (ML). It states that we should choose a parameter θ which maximizes the likelihood P(Dt|θ) of the observed data. Under weak assumptions, ML estimators are asymptotically efficient. As a learning algorithm, one can use e.g. a gradient descent algorithm and iterate
Here,

where θ denotes the set of arguments of the densities. If
Definition 8.11 (KL divergence). For two distributions q(x) and p(x)
likelihood of a single new data point and the (GP) prior from the result of the previous approximation step. Let
To derive the updated posterior. Since post is no longer Gaussian, we use a variational technique in order to project it to the closest Gaussian process
In order to compute the on-line approximations of the mean and covariance kernel Kt we apply Lemma 1 sequentially with only one likelihood term
The averages in (7) are with respect to the Gaussian process at time t and the derivatives taken with respect to
The Sparse GP Algorithm
For each data element (yt+1, xt+1) we will iterate the following steps:
1. Compute
2. If
3. (else) Perform the update eq. (9) using the unit vector
4. If the size of the BV set is larger than d, then compute the scores
Suppose we have a training dataset
Conditional Probabilities

where p(y|f ) is the likelihood and p(f ) the GP prior. The data induce a posterior GP which is specified by a posterior mean function and a posterior covariance function:
To define a sparse method that directly approximates the posterior GP mean and covariance functions in eq. (1). This posterior GP can be also described by the predictive Gaussian
where
we equivalently write p(z|y) as
Suppose now that
Thus, we expect q(z) to be only an approximation to p(z|y). In such case, we can choose φ(fm) to be a “free” variational Gaussian distribution, where in general φ(fm) != p(fm|y), that depends on a mean vector μ and a covariance matrix A.
the approximate posterior GP mean and covariance functions as follows
The question that now arises is how do we select the φ distribution, i.e. (μ, A), and the inducing inputs Xm. a variational method that allows to jointly specify these quantities and treat Xm as a variational parameter which is rigorously selected by minimizing the KL divergence.
使用

Next we describe a variational method that allows to jointly specify these quantities and treat Xm as a variational parameter which is rigorously selected by minimizing the KL divergence.
Sigmoid/logistic function


Logistic regression
, where
The cost function (NLL the likelihood function) – cross entropy error function
where
one can show that H is a positive definite , hence the NLL is convex (凸函数) and has a unique global minimum. To find the minimum …
Gradient descent
Newton’s method
In the WLS, the unknowns V and β arise together. And the basic idea of Iteratively Re-weighted Least Square is that you are going to update one given the other. The key is finding out the right formula for
V that depends on β
β that depends on V
Then either one would be initialized and updated iteratively.

Note: Found the likelihood take the negative log found the derivative to get the gradient follow the gradient or follow the gradients weighted by the Hessian
Posterior in Bayesian:
, this integrate cannot be done by hand . y is binary… so we have And with gaussian prior

the bayesian prediction function:
with normalized weight (w)
with un-normalized weight (w)
want
Predictive distribution,
we have
logistic function :
[ Bayesian say the prediction will be given by the likelihood , each prediction will be weigthed by how probablity it is, and the probablity is measured accoridng to the posterior distriubution which is the quantity that take into account prior information and the training data ]
With Monte Carlo method

belongs to a finite set
initial state
transition probabilities:
Markov property/assumption: "given current state, the past doesn't matter"



xxxxxxxxxx#eactly show how frequency of the bebing in j is calcauted in codestate_space = ("sunny", "cloudy", "rainy")...n_steps = 20000states = [0]for i in range(n_steps): states.append(np.random.choice((0, 1, 2), p=transition_matrix[states[-1]]))states = np.array(states)...offsets = range(1, n_steps, 5)for i, label in enumerate(state_space): #here cacluate the sum of the frequence of the certain states and normalize it is the eactly \pi(x) ax.plot(offsets, [np.sum(states[:offset] == i) / offset for offset in offsets], label=label)
Above gives one distribution answer to the question " what's the system state after certain time t"
A geometric
(refer to frequncey be in state j) In order to sample from a distribution
For didactic purposes, let’s for now consider both a discrete state space and discrete “time”. The key quantity characterizing a Markov chain is the transition operator
a transition matrix T, or be continuous, in which case T would be a transition kernel. while considering continuous distributions, but all concepts presented here transfer to the discrete case.
If we could design the transition kernel in such a way that the next state is already drawn from π, we would be done, as our Markov chain would… well… immediately sample from π. Unfortunately, to do this, we need to be able to sample from π, which we can’t.
Initialise
For
If
else
Geometric Series -- from Wolfram MathWorld
series
is given by
Multiplying both sides by
and subtracting (3) from (2) then gives
so
For
Similarly, if the sums are taken starting at
the latter of which is valid for
So far, we've seen a range of different algorithms
With supervised learning algorithms - performance is pretty similar
What matters more often is;
The amount of training data
Skill of applying algorithms
One final supervised learning algorithm that is widely used - support vector machine (SVM)
Compared to both logistic regression and neural networks, a SVM sometimes gives a cleaner way of learning non-linear functions
Later in the course we'll do a survey of different supervised learning algorithms
An alternative view of logistic regression
Start with logistic regression, see how we can modify it to get the SVM
As before, the logistic regression hypothesis is as follows
And the sigmoid activation function looks like this
In order to explain the math, we use z as defined above
What do we want logistic regression to do?
We have an example where y = 1
Then we hope
With 
Similarly, when y = 0
Then we hope h_(θ)(x) is close to 0
With h_(θ)(x) close to 0, (θ^(T) x) must be much less than 0
This is our classic view of logistic regression
Let's consider another way of thinking about the problem
Alternative view of logistic regression
If you look at cost function, each example contributes a term like the one below to the overall cost function
For the overall cost function, we sum over all the training examples using the above function, and have a 1/m term
If you then plug in the hypothesis definition (h_(θ)(x)), you get an expanded cost function equation;
So each training example contributes that term to the cost function for logistic regression
If y = 1 then only the first term in the objective matters
If we plot the functions vs. z we get the following graph
This plot shows the cost contribution of an example when y = 1 given z
So if z is big, the cost is low - this is good!
But if z is 0 or negative the cost contribution is high
This is why, when logistic regression sees a positive example, it tries to set θ^(T) x to be a very large term
If y = 0 then only the second term matters
We can again plot it and get a similar graph
Same deal, if z is small then the cost is low
But if s is large then the cost is massive
SVM cost functions from logistic regression cost functions
To build a SVM we must redefine our cost functions
When y = 1
Take the y = 1 function and create a new cost function
Instead of a curved line create two straight lines (magenta) which acts as an approximation to the logistic regression y = 1 function
Take point (1) on the z axis
Flat from 1 onwards
Grows when we reach 1 or a lower number
This means we have two straight lines
Flat when cost is 0
Straight growing line after 1
So this is the new y=1 cost function
Gives the SVM a computational advantage and an easier optimization problem
We call this function cost₁(z)
Similarly
When y = 0
Do the equivalent with the y=0 function plot
We call this function cost₀(z)
So here we define the two cost function terms for our SVM graphically
How do we implement this?
The complete SVM cost function
As a comparison/reminder we have logistic regression below
If this looks unfamiliar its because we previously had the - sign outside the expression
For the SVM we take our two logistic regression y=1 and y=0 terms described previously and replace with
cost₁(θ^(T) x)
cost₀(θ^(T) x)
So we get
SVM notation is slightly different
In convention with SVM notation we rename a few things here
1) Get rid of the 1/m terms
This is just a slightly different convention
By removing 1/m we should get the same optimal values for
1/m is a constant, so should get same optimization
e.g. say you have a minimization problem which minimizes to u = 5
If your cost function * by a constant, you still generates the minimal value
That minimal value is different, but that's irrelevant
2) For logistic regression we had two terms;
Training data set term (i.e. that we sum over m) = A
Regularization term (i.e. that we sum over n) = B
So we could describe it as A + λB
Need some way to deal with the trade-off between regularization and data set terms
Set different values for λ to parametrize this trade-off
Instead of parameterization this as A + λB
For SVMs the convention is to use a different parameter called C
So do CA + B
If C were equal to 1/λ then the two functions (CA + B and A + λB) would give the same value
So, our overall equation is
Unlike logistic, h_(θ)(x) doesn't give us a probability, but instead we get a direct prediction of 1 or 0
So if θ^(T) x is equal to or greater than 0 --> h_(θ)(x) = 1
Else --> h_(θ)(x) = 0
Sometimes people refer to SVM as large margin classifiers
We'll consider what that means and what an SVM hypothesis looks like
The SVM cost function is as above, and we've drawn out the cost terms below
Left is cost₁ and right is cost₀
What does it take to make terms small
If y =1
cost₁(z) = 0 only when z >= 1
If y = 0
cost₀(z) = 0 only when z <= -1
Interesting property of SVM
If you have a positive example, you only really need z to be greater or equal to 0
If this is the case then you predict 1
SVM wants a bit more than that - doesn't want to *just* get it right, but have the value be quite a bit bigger than zero
Throws in an extra safety margin factor
Logistic regression does something similar
What are the consequences of this?
Consider a case where we set C to be huge
C = 100,000
So considering we're minimizing CA + B
If C is huge we're going to pick an A value so that A is equal to zero
What is the optimization problem here - how do we make A = 0?
Making A = 0
If y = 1
Then to make our "A" term 0 need to find a value of θ so (θ^(T) x) is greater than or equal to 1
Similarly, if y = 0
Then we want to make "A" = 0 then we need to find a value of θ so (θ^(T) x) is equal to or less than -1
So - if we think of our optimization problem a way to ensure that this first "A" term is equal to 0, we re-factor our optimization problem into just minimizing the "B" (regularization) term, because
When A = 0 --> A*C = 0
So we're minimizing B, under the constraints shown below
Turns out when you solve this problem you get interesting decision boundaries
The green and magenta lines are functional decision boundaries which could be chosen by logistic regression
But they probably don't generalize too well
The black line, by contrast is the the chosen by the SVM because of this safety net imposed by the optimization graph
More robust separator
Mathematically, that black line has a larger minimum distance (margin) from any of the training examples
By separating with the largest margin you incorporate robustness into your decision making process
We looked at this at when C is very large
SVM is more sophisticated than the large margin might look
If you were just using large margin then SVM would be very sensitive to outliers 
You would risk making a ridiculous hugely impact your classification boundary
A single example might not represent a good reason to change an algorithm
If C is very large then we do use this quite naive maximize the margin approach
So we'd change the black to the magenta
But if C is reasonably small, or a not too large, then you stick with the black decision boundary
What about non-linearly separable data?
Then SVM still does the right thing if you use a normal size C
So the idea of SVM being a large margin classifier is only really relevant when you have no outliers and you can easily linearly separable data
Means we ignore a few outliers
Vector inner products
Have two (2D) vectors u and v - what is the inner product (u^(T) v)?
Plot u on graph
i.e u₁ vs. u₂
One property which is good to have is the norm of a vector
Written as ||u||
This is the euclidean length of vector u
So ||u|| = SQRT(u₁² + u₂²) = real number
i.e. length of the arrow above
Can show via Pythagoras
For the inner product, take v and orthogonally project down onto u
First we can plot v on the same axis in the same way (v₁ vs v₁)
Measure the length/magnitude of the projection
So here, the green line is the projection
p = length along u to the intersection
p is the magnitude of the projection of vector v onto vector u
Possible to show that
u^(T) v = p * ||u||
So this is one way to compute the inner product
u^(T) v = u₁v₁+ u₂v₂
So therefore
p * ||u|| = u₁v₁+ u₂v₂
This is an important rule in linear algebra
We can reverse this too
So we could do
v^(T) u = v₁u₁+ v₂u₂
Which would obviously give you the same number
p can be negative if the angle between them is 90 degrees or more
So here p is negative
Use the vector inner product theory to try and understand SVMs a little better
SVM decision boundary
For the following explanation - two simplification
Set θ₀= 0 (i.e. ignore intercept terms)
Set n = 2 - (x₁, x₂)
i.e. each example has only 2 features
Given we only have two parameters we can simplify our function to
And, can be re-written as 
Should give same thing
We may notice that
The term in red is the norm of θ
If we take θ as a 2x1 vector
If we assume θ₀ = 0 its still true
So, finally, this means our optimization function can be re-defined as 
So the SVM is minimizing the squared norm
Given this, what are the (θ^(T) x) parameters doing?
Given θ and given example x what is this equal to
We can look at this in a comparable manner to how we just looked at u and v
Say we have a single positive training example (red cross below)
Although we haven't been thinking about examples as vectors it can be described as such
Now, say we have our parameter vector θ and we plot that on the same axis
The next question is what is the inner product of these two vectors
p, is in fact p^(i), because it's the length of p for example i
Given our previous discussion we know
(θ^(T) x^(i )) = p^(i )* ||θ||
= θ₁x^(i)₁ + θ₂x^(i)₂
So these are both equally valid ways of computing θ^(T) x^(i)
What does this mean?
The constraints we defined earlier
(θ^(T) x) >= 1 if y = 1
(θ^(T) x) <= -1 if y = 0
Can be replaced/substituted with the constraints
p^(i )* ||θ|| >= 1 if y = 1
p^(i )* ||θ|| <= -1 if y = 0
Writing that into our optimization objective 
So, given we've redefined these functions let us now consider the training example below
Given this data, what boundary will the SVM choose? Note that we're still assuming θ₀ = 0, which means the boundary has to pass through the origin (0,0)
Green line - small margins
SVM would not chose this line
Decision boundary comes very close to examples
Lets discuss why the SVM would not chose this decision boundary
Looking at this line
We can show that θ is at 90 degrees to the decision boundary
θ is always at 90 degrees to the decision boundary (can show with linear algebra, although we're not going to!)
So now lets look at what this implies for the optimization objective
Look at first example (x¹)
Project a line from x¹ on to to the θ vector (so it hits at 90 degrees)
The distance between the intersection and the origin is (p¹)
Similarly, look at second example (x²)
Project a line from x² into to the θ vector
This is the magenta line, which will be negative (p²)
If we overview these two lines below we see a graphical representation of what's going on;
We find that both these p values are going to be pretty small
If we look back at our optimization objective
We know we need p¹ * ||θ|| to be bigger than or equal to 1 for positive examples
If p is small
Means that ||θ|| must be pretty large
Similarly, for negative examples we need p² * ||θ|| to be smaller than or equal to -1
We saw in this example p² is a small negative number
So ||θ|| must be a large number
Why is this a problem?
The optimization objective is trying to find a set of parameters where the norm of theta is small
So this doesn't seem like a good direction for the parameter vector (because as p values get smaller ||θ|| must get larger to compensate)
So we should make p values larger which allows ||θ|| to become smaller
So lets chose a different boundary
Now if you look at the projection of the examples to θ we find that p¹ becomes large and ||θ|| can become small
So with some values drawn in
This means that by choosing this second decision boundary we can make ||θ|| smaller
Which is why the SVM choses this hypothesis as better
This is how we generate the large margin effect
The magnitude of this margin is a function of the p values
So by maximizing these p values we minimize ||θ||
Finally, we did this derivation assuming θ₀ = 0,
If this is the case we're entertaining only decision boundaries which pass through (0,0)
If you allow θ₀ to be other values then this simply means you can have decision boundaries which cross through the x and y values at points other than (0,0)
Can show with basically same logic that this works, and even when θ₀ is non-zero when you have optimization objective described above (when C is very large) that the SVM is looking for a large margin separator between the classes
What are kernels and how do we use them
We have a training set
We want to find a non-linear boundary
Come up with a complex set of polynomial features to fit the data
Have h_(θ)(x) which
Returns 1 if the combined weighted sum of vectors (weighted by the parameter vector) is less than or equal to 0
Else return 0
Another way of writing this (new notation) is
That a hypothesis computes a decision boundary by taking the sum of the parameter vector multiplied by a new feature vector f, which simply contains the various high order x terms
e.g.
h_(θ)(x) = θ₀+ θ₁f₁+ θ₂f₂ + θ₃f₃
Where
f₁= x₁
f₂ = x₁x₂
f₃ = ...
i.e. not specific values, but each of the terms from your complex polynomial function
Is there a better choice of feature f than the high order polynomials?
As we saw with computer imaging, high order polynomials become computationally expensive
New features
Define three features in this example (ignore x₀)
Have a graph of x₁ vs. x₂ (don't plot the values, just define the space)
Pick three points in that space
These points l¹, l², and l³, were chosen manually and are called landmarks
Given x, define f1 as the similarity between (x, l¹)
= exp(- (|| x - l¹ ||² ) / 2σ²)
= 
|| x - l¹ || is the euclidean distance between the point x and the landmark l¹ squared
Disussed more later
If we remember our statistics, we know that
σ is the standard deviation
σ² is commonly called the variance
Remember, that as discussed
So, f2 is defined as
f2 = similarity(x, l¹) = exp(- (|| x - l² ||² ) / 2σ²)
And similarly
f3 = similarity(x, l²) = exp(- (|| x - l¹ ||² ) / 2σ²)
This similarity function is called a kernel
This function is a Gaussian Kernel
So, instead of writing similarity between x and l we might write
f1 = k(x, l¹)
Diving deeper into the kernel
So lets see what these kernels do and why the functions defined make sense
Say x is close to a landmark
Then the squared distance will be ~0
So
Which is basically e⁻⁰
Which is close to 1
Say x is far from a landmark
Then the squared distance is big
Gives e^(-large number)
Which is close to zero
Each landmark defines a new features
If we plot f1 vs the kernel function we get a plot like this
Notice that when x = [3,5] then f1 = 1
As x moves away from [3,5] then the feature takes on values close to zero
So this measures how close x is to this landmark
What does σ do?
σ² is a parameter of the Gaussian kernel
Defines the steepness of the rise around the landmark
Above example σ² = 1
Below σ² = 0.5
We see here that as you move away from 3,5 the feature f1 falls to zero much more rapidly
The inverse can be seen if σ² = 3
Given this definition, what kinds of hypotheses can we learn?
With training examples x we predict "1" when
θ₀+ θ₁f₁+ θ₂f₂ + θ₃f₃ >= 0
For our example, lets say we've already run an algorithm and got the
θ₀ = -0.5
θ₁ = 1
θ₂ = 1
θ₃ = 0
Given our placement of three examples, what happens if we evaluate an example at the magenta dot below?
Looking at our formula, we know f1 will be close to 1, but f2 and f3 will be close to 0
So if we look at the formula we have
θ₀+ θ₁f₁+ θ₂f₂ + θ₃f₃ >= 0
-0.5 + 1 + 0 + 0 = 0.5
0.5 is greater than 1
If we had another point far away from all three
This equates to -0.5
So we predict 0
Considering our parameter, for points near l¹ and l² you predict 1, but for points near l³ you predict 0
Which means we create a non-linear decision boundary that goes a lil' something like this;
Inside we predict y = 1
Outside we predict y = 0
So this show how we can create a non-linear boundary with landmarks and the kernel function in the support vector machine
But
How do we get/chose the landmarks
What other kernels can we use (other than the Gaussian kernel)
Filling in missing detail and practical implications regarding kernels
Spoke about picking landmarks manually, defining the kernel, and building a hypothesis function
Where do we get the landmarks from?
For complex problems we probably want lots of them
Choosing the landmarks
Take the training data
For each example place a landmark at exactly the same location
So end up with m landmarks
One landmark per location per training example
Means our features measure how close to a training set example something is
Given a new example, compute all the f values
Gives you a feature vector f (f₀ to f_(m))
f₀ = 1 always
A more detailed look at generating the f vector
If we had a training example - features we compute would be using (x^(i), y^(i))
So we just cycle through each landmark, calculating how close to that landmark actually x^(i) is
f₁^(i), = k(x^(i), l¹)
f₂^(i), = k(x^(i), l²)
...
f_(m)^(i), = k(x^(i), l^(m))
Somewhere in the list we compare x to itself... (i.e. when we're at f_(i)^(i))
So because we're using the Gaussian Kernel this evalues to 1
Take these m features (f₁, f₂ ... f_(m)) group them into an [m +1 x 1] dimensional vector called f
f^(i) is the f feature vector for the ith example
And add a 0th term = 1
Given these kernels, how do we use a support vector machine
SVM hypothesis prediction with kernels
Predict y = 1 if (θ^(T) f) >= 0
Because θ = [m+1 x 1]
And f = [m +1 x 1]
So, this is how you make a prediction assuming you already have θ
How do you get θ?
SVM training with kernels
Use the SVM learning algorithm
Now, we minimize using f as the feature vector instead of x
By solving this minimization problem you get the parameters for your SVM
In this setup, m = n
Because number of features is the number of training data examples we have
One final mathematic detail (not crucial to understand)
If we ignore θ₀ then the following is true
What many implementations do is 
Where the matrix M depends on the kernel you use
Gives a slightly different minimization - means we determine a rescaled version of θ
Allows more efficient computation, and scale to much bigger training sets
If you have a training set with 10 000 values, means you get 10 000 features
Solving for all these parameters can become expensive
So by adding this in we avoid a for loop and use a matrix multiplication algorithm instead
You can apply kernels to other algorithms
But they tend to be very computationally expensive
But the SVM is far more efficient - so more practical
Lots of good off the shelf software to minimize this function
SVM parameters (C)
Bias and variance trade off
Must chose C
C plays a role similar to 1/LAMBDA (where LAMBDA is the regularization parameter)
Large C gives a hypothesis of low bias high variance --> overfitting
Small C gives a hypothesis of high bias low variance --> underfitting
SVM parameters (σ²)
Parameter for calculating f values
Large σ² - f features vary more smoothly - higher bias, lower variance
Small σ² - f features vary abruptly - low bias, high variance
So far spoken about SVM in a very abstract manner
What do you need to do this
Use SVM software packages (e.g. liblinear, libsvm) to solve parameters θ
Need to specify
Choice of parameter C
Choice of kernel
Choosing a kernel
We've looked at the Gaussian kernel
Need to define σ (σ²)
Discussed σ²
When would you chose a Gaussian?
If n is small and/or m is large
e.g. 2D training set that's large
If you're using a Gaussian kernel then you may need to implement the kernel function
e.g. a function
fi = kernel(x1,x2)
Returns a real number
Some SVM packages will expect you to define kernel
Although, some SVM implementations include the Gaussian and a few others
Gaussian is probably most popular kernel
NB - make sure you perform feature scaling before using a Gaussian kernel
If you don't features with a large value will dominate the f value
Could use no kernel - linear kernel
Predict y = 1 if (θ^(T) x) >= 0
So no f vector
Get a standard linear classifier
Why do this?
If n is large and m is small then
Lots of features, few examples
Not enough data - risk overfitting in a high dimensional feature-space
Other choice of kernel
Linear and Gaussian are most common
Not all similarity functions you develop are valid kernels
Must satisfy Merecer's Theorem
SVM use numerical optimization tricks
Mean certain optimizations can be made, but they must follow the theorem
Polynomial Kernel
We measure the similarity of x and l by doing one of
(x^(T) l)²
(x^(T) l)³
(x^(T) l+1)³
General form is
(x^(T) l+Con)^(D)
If they're similar then the inner product tends to be large
Not used that often
Two parameters
Degree of polynomial (D)
Number you add to l (Con)
Usually performs worse than the Gaussian kernel
Used when x and l are both non-negative
String kernel
Used if input is text strings
Use for text classification
Chi-squared kernel
Histogram intersection kernel
Multi-class classification for SVM
Many packages have built in multi-class classification packages
Otherwise use one-vs all method
Not a big issue
Logistic regression vs. SVM
When should you use SVM and when is logistic regression more applicable
If n (features) is large vs. m (training set)
e.g. text classification problem
Feature vector dimension is 10 000
Training set is 10 - 1000
Then use logistic regression or SVM with a linear kernel
If n is small and m is intermediate
n = 1 - 1000
m = 10 - 10 000
Gaussian kernel is good
If n is small and m is large
n = 1 - 1000
m = 50 000+
SVM will be slow to run with Gaussian kernel
In that case
Manually create or add more features
Use logistic regression of SVM with a linear kernel
Logistic regression and SVM with a linear kernel are pretty similar
Do similar things
Get similar performance
A lot of SVM's power is using diferent kernels to learn complex non-linear functions
For all these regimes a well designed NN should work
But, for some of these problems a NN might be slower - SVM well implemented would be faster
SVM has a convex optimization problem - so you get a global minimum
It's not always clear how to chose an algorithm
Often more important to get enough data
Designing new features
Debugging the algorithm
SVM is widely perceived a very powerful learning algorithm
The weights are simply the coefficients that define a separating plane. For the moment, forget about neurons and just consider the geometric definition of a plane in N dimensions:
w1*x1 + w2*x2 + ... + wN*xN - w0 = 0
You can also think of this as being a dot product:
w*x - w0 = 0
where w and x are both length-N vectors. This equation holds for all points on the plane. Recall that we can multiply the above equation by a constant and it still holds so we can define the constants such that the vector w has unit length. Now, take out a piece of paper and draw your x-y axes (x1 and x2 in the above equations). Next, draw a line (a plane in 2D) somewhere near the origin. w0 is simply the perpendicular distance from the origin to the plane and w is the unit vector that points from the origin along that perpendicular. If you now draw a vector from the origin to any point on the plane, the dot product of that vector with the unit vector w will always be equal to w0 so the equation above holds, right? This is simply the geometric definition of a plane: a unit vector defining the perpendicular to the plane (w) and the distance (w0) from the origin to the plane.
Now our neuron is simply representing the same plane as described above but we just describe the variables a little differently. We'll call the components of x our "inputs", the components of w our "weights", and we'll call the distance w0 a bias. That's all there is to it.
Getting a little beyond your actual question, we don't really care about points on the plane. We really want to know which side of the plane a point falls on. While w*x - w0 is exactly zero on the plane, it will have positive values for points on one side of the plane and negative values for points on the other side. That's where the neuron's activation function comes in but that's beyond your actual question.